Statistical Disclosure Control

Contact:

Peter-Paul de Wolf
Statistics Netherlands
P.O. Box 24500
2490 HA The Hague
The Netherlands
Phone: +31 70 337 5060

Last update: 10 Oct 2011

Microdata: new disclosure risk assessment methodology (WP 1.2)

Leading partner: URV

Participating partners: Soton, IStat

This workpackage is devoted to research oriented to assessing the disclosure risk for microdata at the individual record level. The work mainly focuses on unperturbed microdata (e.g. microdata resulting from sampling a population of records), which are the kind of microdata released by many important statistical offices nowadays (e.g. U.S. Bureau of the Census, ONS, etc.). Unlike WP 1-1, this workpackage does not deal with the development of new SDC masking methods. Thus, both workpackages are complementary, and they are both aimed at improving m-ARGUS. The following are the objectives of this workpackage broken down by tasks:

Task T1 (responsible Soton)

Objectives

Disclosure risk can be measured at either the file level or the record level. Record-level measures are useful for use in conjunction with disclosure limitation methods which are applied at the record level, for example local suppression. The objectives of this workpackage are:
- To extend the methods proposed by Skinner and Holmes (see references in the Description of Work below) to allow for misclassification of the key variables
- To investigate the application of record-linkage ideas (see references in the Description of Work below) to record-level measures of disclosure risk
- To investigate record-level measures of risk within the framework for µ-ARGUS

Description of the work

Skinner and Holmes (1998) consider records r with key variable values x(r) in the microdata and corresponding external units r* with values . Writing r = r* if record r belongs to external unit r*, a measure of risk is
Pr(r=r*|x(r), x(r*))
if all population units s* are included in the microdata file with equal probability. In the case of no misclassification this probability reduced to 1/Fx(r), where Fx(r) is the number of units in the population with key variables x(r). When measurement error is present the terms in (a) may be expressed in terms of misclassification probabilities. The aim would be to develop such measures, extending the approach of Skinner and Holmes (1998) and drawing on the theory of misclassification (Kuha and Skinner, 1997) and record linkage (Copas and Hilton, 1990; Winkler, 1998).

These measures will depend on specified assumptions about the nature and degree of misclassification both in the microdata and in the external data. In the absence of measurement error, Skinner and Holmes (1998) consider a simple measure of risk for records which are unique in the sample with respect to some categorical key variables. The measure is given by exp(-(1-π)f/π), where π is the sampling fraction and f is a fitted frequency for the combination of key variable values of the given record. The measure may be interpreted as the estimated probability that the variable combination is unique in the population. This is the simplest measure which will be considered in the framework of µ-ARGUS. The computation of the fitted frequencies would require some iterated proportional fitting. The measure could be extended to records which are not unique in the sample.

References

Copas, J. B. and Hilton, F. J. (1990) Record Linkage: statistical models for matching computer records, (with discussion) J. Roy. Statist. Soc., A, 287-320
Kuha, J. T. and Skinner, C. J. (1997) Categorical data analysis and misclassification. In L. Lyberg et al (eds.) Survey Measurement and Process Quality, Wiley, New York, 633-670.
Skinner, C. J. and Holmes, D. J. (1998) Estimating the re-identification risk per record in microdata. J. Official Statist. 14, 361-372.
Winkler, W. E. (1998) Re-identification methods for evaluating the confidentiality of analytically valid microdata. Research in Official Statistics, 2,87-104

Milestones and expected result

- Development of theory for record-level measures under misclassification
- Programming of measures for methodological investigation
- Completion of numerical evaluation of methods
The expected result of the project is an improved set of methods for assessing disclosure risk in microdata.

Task T2 (responsible Soton).

Objectives

- To apply the methods developed on Task T1 to the Labour Force Survey (an instance of survey of EU-wide interest).
- To assess the protection afforded by sampling and measurement error.
- To study the dependence of disclosure risk on different levels of detail of the potential key variables especially geography and occupation.

Description of the work

The Labour Force Survey will be considered (as a survey raising EU-wide interest). The following will be done:
Identify potential sets of key variables.
1. Obtain best estimates of misclassification rates for these key variables from various methodological studies.
2. Determine alternative levels of detail in the key variables.
3. Apply the record-level measures of risk developed in Task T1 to the survey data at the different levels of detail.
4. Assess implications for the use of disclosure limitation methods in the light of the uses of the survey and its different forms of release.

Milestones and expected result

- Determination of misclassification rates
- Application of record-level measures of risk to data
- Review of disclosure limitation implications
The results of the project are expected to help assess the value of record-level measures of disclosure risk and provide, through one case study, a model for the evaluation of disclosure risk in other surveys

Task T3 (responsible IStat).

Objectives

To build into µ-ARGUS the individual disclosure risk approach for complex micro-data (hierarchical) as defined in the Esprit n° 20462, SDC and taking advantage of the developments of Task T1.

Description of the work

In order to improve the capabilities of µ-ARGUS and give a wider choice of methodology to the user, the individual unit risk (called record-level risk in T1 and T2) approach will be implemented in µ-ARGUS. The programs already available in SAS as output of the SDC project will be used as a basis to define a C procedure to estimate the individual risk. Moreover, efficient protection algorithms will have to be developed that take into account dependencies in the data.
Tasks to be carried out include
1 Study of the steps to be followed to include the methodology into the software: definition of metadata, in particular key variables and their characteristics with respect to the dependencies, input of the data, specification of dependence structure, estimation of the individual risk according to the type of dependence structure, identification of all factors that influence risk, definition of the iterative procedure to obtain a safe file.
2 Preparation of a program flow chart.
3 Migration from SAS to C.
4 Integration with existing µ-ARGUS software.
5 Development of efficient protection algorithms for dependent data.
6 Testing.
7 Validation.

Milestones and expected result

The implementation of individual risk of disclosure into µ-ARGUS will widen user choice. The resulting evaluation of disclosure risk will enable the user to measure the safety levels reached in the micro-data file.